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Title of disclosure (in English) 

Method of Improving TTS Intelligibility in Long Passages 

Main Idea 

1. Background: What is the problem solved by your invention? Describe known solutions to this problem 
(if any). What are the drawbacks of such known solutions, or why is an additional solution required? Cite 
any relevant technical documents or references. 

Text-to-speech software ("TTS") has made vast improvements in the past 1-2 years. What used 
to be a serviceable but robotic-sounding system now mimics the human voice with great fidelity. 
But paradoxically, the increased fidelity leads to an increase in perceived faults - as the sound 
gets closer to that of a live human, all of its shortcomings come more clearly into view. 

And one of those limitations is its failure to hold the attention of a listener for long passages. 
While we plan to use TTS to play back news stories and long emails, its limited prosodic 
richness and monotonous tone present a barrier. When listening to a long passage, there are 
sections of great clarity and punctuated by occasional words or word groups that are harder to 
understand, or that suffer from bumpy synthesis. These junctures present an increased cognitive 
load, and the listener must work harder to decipher what he or she has just heard. Meanwhile, 
the TTS marches on, so while the listener is working out the previous word the software is busy 
producing new ones. The end result is listener fatigue. It feels as though the TTS is being 
insensitive to the needs of the listener, whose mind ultimately begins to wander. 

There are no current solutions to this problem. 

2. Summary of Invention: Briefly describe the core idea of your invention (saving the details for questions 
#3 below). Describe the advantage(s) of using your invention instead of the known solutions described 
above. 

Let's look at a sentence from a news report: 

,H Bank of America tends to be a pretty good litmus test for the financial services sector as 
a whole,' said Doug Lister of Wachovia Securities, a financial services company." 

Most of this text will sound quite good coming out of the TTS engine. But when we get to the 
unfamiliar name "Doug Lister," we're on shaky territory. Did the TTS engine just say "Doug 
Lister," or was it "Doug Glister"? We've never heard either name, they're equally likely, and 
would sound pretty much the same. And while we're pondering that, the engine is generating still 
more words, until it gets to "Wachovia." Now we have to decode another word that we're not so 
sure about. Was that "Wockovious Securities," or "Wock Ovia Securities"? No, it was 
"Wachovia Securities." And with enough of these incidents, we begin to feel as though we're 
working too hard and we fall behind, ultimately missing some vital content. 
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Live news readers compensate for this problem by slowing down slightly at unfamiliar words and 
by adding an imperceptible pause before and after the words. They may even allow themselves to 
sound slightly hesitant. This does two things - it signals the listener to pay extra attention, and it 
gives the listener some time to catch up. A live news reader would therefore read. "'Bank of 
America tends to be a pretty good litmus test for the financial services sector as a whole.' said 
Doug Lister of Wachovia Securities, a financial services company." 

While our current TTS systems don't truly "understand" the content of their speech to the point 
where we could program them to know what words to emphasize, some of these problems areas 
are in fact predictable and therefore lend themselves to software solutions. 



3. Description: Describe how your invention works, and how it could be implemented, using text, 
diagrams and flow charts as appropriate. 

What we're trying to do is to determine in advance which words or word pairs are likely to suffer 
from uneven synthesis. There are several metrics that can be employed. For example, the TTS 
System has a dictionary and can spot words that are not in it. The front end can recognize 
capitalization rules. Therefore, it can with some reliability detect unfamiliar proper names, which 
have a high likelihood of synthesis problems. So when one is detected, a very small pause can be 
added, and/or the word can be synthesized with longer durations. 

We could also make use of a statistical language model trained on large amounts of text to spot 
low probability words and word sequences. Words or word sequences that receive a low 
probability score would be similarly treated with small pauses and longer durations. 

Other metrics are available to the system to predict when a difficult word or word pair has been 
encountered. We can monitor the cost function during the synthesis process. Splices exhibiting 
high cost are likely to lead to uneven synthesis, and the same treatment can be applied. 

False positives are no cause for concern, If the occasional well -synthesized word is played a little 
slower, this won't sound abnormal. But if we can catch a reasonable percentage of rough 
synthesis and treat it with kid gloves through the strategic application of pauses and duration 
control, we can greatly increase our overall comprehension. 
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What we're trying to do is to determine in advance which words or word pairs are likely to suffer 
from uneven synthesis. There are several metrics that can be employed. For example, the TTS 
System has a dictionary and so can spot words that are not in it. The front end can recognize 
capitalization rules. Therefore, it can with some reliability detect unfamiliar proper names, which 
have a high likelihood of synthesis problems. So when one is detected, a very small pause can 
be added, and/or the word can be synthesized with longer durations. 

We could also make use of a statistical language model trained on large amounts of text to spot 
low probability words and word sequences. Words or word sequences that receive a low 
probability score would be similarly treated with small pauses and longer durations. 

Other metrics are available to the system to predict when a difficult word or word pair has been 
encountered. If the segment matching falls below some quality threshold, we can assume this 
may lead to uneven synthesis, and the same treatment can be made available. 

False positives are no cause for concern. If the occasional well-synthesized word is played a little 
slower, this won't sound abnormal. But if we can catch a reasonable percentage of rough 
synthesis and treat it with kid gloves through the strategic application of pauses and duration 
control, we can greatly increase our overall comprehension. 

To summarize: When the TTS System encounters a section of low confidence or unknown words, 
it will add pauses and increase durations. 



How text would be marked up by Rare Sequence Detection 



Input Text 


Hello, Mrs. Wisniewski 


Normalized text 


Hello PO missus wisnefsky 


Normalized text plus rare 
sequence detector 


Hello PO missus P1 <rare> wisnefsky </rare> 
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Method of Improving TTS Intelligibility in Long Passages 
By Andy Aaron and Ellen Eide 

Text-to-speech software ("TTS") has made vast improvements in the past 1-2 years. What used 
to be a serviceable but robotic-sounding system now mimics the human voice with great fidelity. 
But paradoxically, the increased fidelity leads to an increase in perceived faults — as the sound 
gets closer to that of a live human, all of its shortcomings come more clearly into view. 

And one of those limitations is its failure to hold the attention of a listener for long passages. 
While we plan to use TTS to play back news stories and long emails, its limited prosodic 
richness and monotonous tone present a barrier. When listening to a long passage, there are 
sections of great clarity and punctuated by occasional words or word groups that are harder to 
understand, or that suffer from bumpy synthesis. These junctures present an increased cognitive 
load, and the listener must work harder to decipher what he or she has just heard. Meanwhile, 
the TTS marches on, so while the listener is working out the previous word the software is busy 
producing new ones. The end result is listener fatigue. It feels as though the TTS is being 
insensitive to the needs of the listener, whose mind ultimately begins to wander. There are no 
current solutions to this problem. 

Let's look at a sentence from a news report: 

"'Bank of America tends to be a pretty good litmus test for the financial services sector as 
a whole,' said Doug Lister of Wachovia Securities, a financial services company." 

Most of this text will sound quite good coming out of the TTS engine. But when we get to the 
unfamiliar name "Doug Lister," we're on shaky territory. Did the TTS engine just say "Doug 
Lister," or was it "Doug Glister"? We've never heard either name, they're equally likely, and 
would sound pretty much the same. And while we're pondering that, the engine is generating still 
more words, until it gets to "Wachovia." Now we have to decode another word that we're not so 
sure about. Was that "Wockovious Securities," or "Wock Ovia Securities"? No, it was 
"Wachovia Securities." And with enough of these incidents, we begin to feel as though we're 
working too hard and we fall behind, ultimately missing some vital content. 

Live news readers compensate for this problem by slowing down slightly at unfamiliar words 
and by adding an imperceptible pause before and after the words. They may even allow 
themselves to sound slightly hesitant. This does two things it signals the listener to pay extra 
attention, and it gives the listener some time to catch up. A live news reader would therefore read, 
"'Bank of America tends to be a pretty good litmus test for the financial services sector as a 
whole,' said Doug Lister of Wachovia Securities, a financial services company." 

While our current TTS systems don't truly "understand" the content of their speech to the point 
where we could program them to know what words to emphasize, some of these problems areas 
are in fact predictable and therefore lend themselves to software solutions. 

How it works 
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